Day 4: Complete Guide to Transformer Architecture
The Transformer, introduced in the 2017 paper “Attention is All You Need,” is the foundation of modern LLMs. It completely replaced RNN/LSTM and is suitable for large-scale training thanks to its parallel processing capability.
Overall Transformer Structure
Input Text -> [Token Embedding + Positional Encoding]
|
+-------------------------+
| Encoder Block (xN) |
| +---------------------+ |
| | Self-Attention | |
| | + Residual + LN | |
| +---------------------+ |
| | Feed-Forward | |
| | + Residual + LN | |
| +---------------------+ |
+-------------------------+
|
+-------------------------+
| Decoder Block (xN) |
| +---------------------+ |
| | Masked Self-Attn | |
| +---------------------+ |
| | Cross-Attention | |
| +---------------------+ |
| | Feed-Forward | |
| +---------------------+ |
+-------------------------+
|
Output Text
The GPT series uses only the decoder, while BERT uses only the encoder.
Simplified Self-Attention Implementation
import numpy as np
def self_attention(query, key, value):
"""Scaled Dot-Product Attention"""
d_k = query.shape[-1]
# 1. Compute similarity via dot product of Query and Key
scores = np.matmul(query, key.T) / np.sqrt(d_k)
# 2. Normalize weights with Softmax
exp_scores = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
attention_weights = exp_scores / np.sum(exp_scores, axis=-1, keepdims=True)
# 3. Apply weights to Values
output = np.matmul(attention_weights, value)
return output, attention_weights
# 3 tokens, 4-dimensional vectors
seq_length, d_model = 3, 4
x = np.random.randn(seq_length, d_model)
output, weights = self_attention(x, x, x)
print(f"Attention weights:\n{weights.round(3)}")
print(f"Output shape: {output.shape}") # (3, 4)
Feed-Forward Network and Layer Normalization
import numpy as np
def layer_norm(x, eps=1e-6):
"""Layer Normalization: normalizes each token vector"""
mean = np.mean(x, axis=-1, keepdims=True)
std = np.std(x, axis=-1, keepdims=True)
return (x - mean) / (std + eps)
def feed_forward(x, w1, b1, w2, b2):
"""Position-wise Feed-Forward Network"""
# Expand dimensions then contract (typically 4x expansion)
hidden = np.maximum(0, np.matmul(x, w1) + b1) # ReLU activation
output = np.matmul(hidden, w2) + b2
return output
def transformer_block(x, w1, b1, w2, b2):
"""A single Transformer block"""
# Self-Attention + Residual Connection + Layer Norm
attn_output, _ = self_attention(x, x, x)
x = layer_norm(x + attn_output) # Residual connection
# Feed-Forward + Residual Connection + Layer Norm
ff_output = feed_forward(x, w1, b1, w2, b2)
x = layer_norm(x + ff_output) # Residual connection
return x
# Initialization and execution
d_model, d_ff = 4, 16
w1 = np.random.randn(d_model, d_ff) * 0.1
b1 = np.zeros(d_ff)
w2 = np.random.randn(d_ff, d_model) * 0.1
b2 = np.zeros(d_model)
x = np.random.randn(3, d_model)
output = transformer_block(x, w1, b1, w2, b2)
print(f"Input shape: {x.shape}, Output shape: {output.shape}")
Summary of Key Components
| Component | Role | Key Idea |
|---|---|---|
| Self-Attention | Captures relationships between tokens | Every token attends to every other token |
| Feed-Forward | Non-linear transformation | Expand then contract dimensions |
| Layer Normalization | Stabilizes training | Normalizes the output of each layer |
| Residual Connection | Ensures gradient flow | Adds input to output |
| Positional Encoding | Provides token order information | Without it, word order is ignored |
The Transformer is a combination of these five components. Tomorrow we’ll dive deeper into the most critical component: the Attention mechanism.
Today’s Exercises
- Explain why deep networks are difficult to train without Residual Connections. Relate your answer to the vanishing gradient problem.
- Stack the
transformer_blockfunction 6 times to build a 6-layer Transformer. Verify that the input/output shapes are preserved. - Summarize which tasks encoder-only (BERT), decoder-only (GPT), and encoder-decoder (T5) architectures are each best suited for.